feat(rw-backend): site index rebuild (catalog/S3 → DB) via scan + worker queue#71
Merged
Conversation
…ker queue
Maintain a fresh per-site DB projection of RW documentation sites so the backend
can answer queries (starting with the doc-comment inbox) without hitting the
catalog or S3 per request. Replaces a single long-running rebuild with a
resilient producer/queue/worker model (the pattern Backstage's catalog uses).
Tables (PostgreSQL; sqlite in dev):
- section_ownership: sparse section→entity claim links, from the catalog scan
- sections / pages: per-site structure registries, from S3 (listSections/listPages)
- site_refresh: the work queue (next_update_at doubles as due-time + claim lease)
Pipeline:
- rw-site-index-scan (global): iterate rwdocs.org/ref-annotated catalog entities;
per site, atomically swap section_ownership links + upsert the queue row; prune
only after a clean, fully-successful scan.
- rw-site-index-worker (local): claim due sites (FOR UPDATE SKIP LOCKED + lease),
load each from S3, swap sections/pages registries; skip the write when a content
hash is unchanged; bounded concurrency via p-limit; per-site errors isolated.
Also: shared iterateAnnotatedEntities in rw-common (search collator refactored to
consume it); rw.siteIndex.{schedule,worker} config; info/debug logging. Read-time
ownership roll-up and the inbox endpoint/UI land in a follow-up.
Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
36c872f to
b9ca7ff
Compare
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Splits the doc-comment inbox's ownership rebuild into a dedicated, reworked subsystem: a site index that keeps a fresh per-site DB projection of RW docs sites (structure + ownership), so the backend can answer queries without hitting the catalog/S3 per request. Replaces the single long-running job with a Backstage-catalog-style producer/queue/worker.
Tables (PostgreSQL; sqlite in dev)
section_ownership— sparse section→entity claim links, from the catalog scansections/pages— per-site structure registries, from S3 (listSections/listPages)site_refresh— work queue (next_update_at= due-time + claim lease)Pipeline
rw-site-index-scan(global): catalog → per-site atomicsection_ownershipswap + queue upsert; prune only after a clean, fully-successful scan.rw-site-index-worker(local): claim due sites (FOR UPDATE SKIP LOCKED+ lease) → load from S3 → swapsections/pages; content-hash short-circuit;p-limitconcurrency; per-site error isolation.Also: shared
iterateAnnotatedEntitiesin rw-common (search collator refactored to consume it);rw.siteIndex.{schedule,worker}config; info/debug logging.Scope / follow-ups
@rwdocs/core >= 0.1.28(carriesRwSite.listSections/listPages).Notes for reviewers
comments/routerintegration suites can time out under heavy local load (unrelated to this PR).🤖 Generated with Claude Code